Code
Output
Summary Statistics for x1. Mean: 9 ; Variance: 11
Summary Statistics for x4. Mean: 9 ; Variance: 11
Module 2.3: Distributions
Old Dominion University
So far, we have characterized certain “moments” of data (i.e., mean and variance).
While these two numbers are helpful, they do not tell the full story. As an example, let’s consider the following data:
Summary Statistics for x1. Mean: 9 ; Variance: 11
Summary Statistics for x4. Mean: 9 ; Variance: 11
Now, check out the underlying data:
Clearly, even those data have the same mean and variance, they have very different distributions. This is different than the case of Michael Jordan and UNC. There, the distributions were in fact very similar except for Michael Jordan’s outlier salary.
Before we can keep talking about distributions, we need to discuss probability briefly.
Probability
A value between 0 and 1 inclusive that represents the likelihood a particular event happens.
The probability that a coin is flipped and it lands on tails is \(\frac{1}{2}\) or \(0.5\) or \(50\%\).1 What about the probabilities and outcomes of flipping two coins and counting the number of tails? Well, since flipping this coin is a random variable, it has a distribution. That distribution is defined by all possible outcomes and the associated probabilities:
| Number of Tails, \(y\) | Probability, \(P(y)\) |
|---|---|
| 0 | 0.25 |
| 1 | 0.50 |
| 2 | 0.25 |
All distributions have a mean and a variance.1 So, how do we find the mean and variance of a probability distribution?
\[\mu = \sum_{i = 1}^g P(y_i) \times y_i\]
\[\sigma^2 = \sum_{i = 1}^g P(y_i) \times (y_i - \mu)^2\]
Note: this should look a lot like the formula for weighted average.
To practice, let’s use the previous distribution of flipping two coins.
\[\mu = \sum_{i = 1}^g P(y_i) \times y_i = (0.25\times0) + (0.5\times1) + (0.25\times2) = 1\]
\[\begin{aligned}\sigma^2 &= \sum_{i = 1}^g P(y_i) \times (y_i - \mu)^2 \\ &= (0.25\times(0-1)^2) + (0.5\times(1-1)^2) + (0.25\times(2-1)^2) \\ &= (0.25\times(-1)^2) + (0.5\times(0)^2) + (0.25\times(1)^2) \\ &= 0.25 + 0 + 0.25 = 0.5\end{aligned}\]
Flipping coins is an example of a discrete distribution. In other words, individual events have non-zero probabilities of occurring. However, when distributions are continuous, the probability of any specific outcome occurring is precisely zero. What? How can that be true?
Consider the lifespan of a tire on a car. The average tire lasts about 54,000 miles. However, what is the probability that your tire lasts exactly 54,000 miles? Essentially 0%.
This is because your tire could last 54,000.1 miles, 54,000.01 miles, and so on. There are an infinite number of miles that your tire could last, so the probability of any specific mile being the one where your tire blows is zero.
When dealing with continuous probabilities, instead of asking for the \(P(y) = M\) (e.g., the probability that your tire lasts exactly \(M\) miles), people ask for the \(P(y) < M\), \(P(y) > m\), or \(m < P(y) < M\) (e.g., the probability that your tire lasts more or less than some number of miles, or the probability that your tire lasts between two numbers of miles.)
In the interest of being concrete, suppose tires last anywhere between 40,000 and 68,000 miles. The distribution is uniform, meaning that the probability of your tire blowing up is the same from 40,000 to 68,000 miles. A question from a statistics class might be: “what is the probability that your tire lasts more than 60,000 miles?”
To answer this, you need to know what fraction of tires last above 60,000. Well, since all tires blow up between 40 and 68, we can say the baseline is 68-40 = 28. Then, 68-60 is the range we’re interested in. So, the probability would be \(\frac{8}{28} = 0.286\).
The above math highlights how you would deal with probabilities and continuous distributions. However, not all continuous distributions are “uniform” like we assumed above. Perhaps the most famous and widely studied distribution is the normal distribution.
The normal distribution:
Below is an example of what a normal distribution looks like:
Normal distributions take on different shapes depending on their means and standard deviations.
Below are some examples of what these changes would do to a standard normal distribution:
To learn the most incredible feature of the normal distribution, you’ll have to wait for later in the module. However, while we wait, let’s discuss how to compare two different normal distributions.
Two particularly famous distributions that are normally distributed are scores on the SAT and ACT. The maximum score for each is 1600 and 36, respectively. The SAT has an average score of 1050 with a standard deviation of 200. The ACT has an average score of 19 with a standard deviation of 6.
So, let’s say someone scores a 1,300 on the SAT and a 29 on the ACT. Which test did they do better on?
Well, let’s first visualize the two distributions and the two scores.
Well, this visualization is not helpful. Since the two distributions have different means and standard deviations, we cannot really compare them. A first step to getting a better comparison would be to subtract off the mean from both distributions. Then, both distributions will have a mean of zero.
Still, this is not helpful. Now, the variances are distorting the comparison. To fix this, we can divide each by their respective variances. This way, both distributions have a mean of zero and a variance of 1.
Now, it should be quite obvious which score was “better”. The SAT score currently has a value of 1.25 while the ACT score has a value of 1.6666667.
What do these numbers mean? In words, this is the number of standard deviations away from the mean the original score is. This actually has a name, and it is a \(z\)-score. The formula for a \(z\)-score is:
\[z = \frac{y - \mu}{\sigma}\]
\(z\)-scores come from what is called a standard normal distribution, which is just a normal distribution with a mean of 0 and standard deviation (and variance) of 1, or \(N(0, 1)\). We know a lot about the standard normal distribution. For example, we know the probability that a random draw from a normal distribution is greater than some number, less than some number, or between two numbers. In most statistics courses, this is where you’d be shown a \(z\)-score table. Rather than spending time on this table, here are some takeaways:
R has some pre-built functions to work with normal distributions.
pnorm() gives \(P(Z < y)\) when lower.tail = TRUE. When lower.tail = FALSE, this gives \(P(Z > y)\)
pnorm(0, lower.tail = TRUE): 0.5pnorm(2, lower.tail = FALSE): 0.0227501.qnorm() is the opposite of pnorm() in that it converts probabilities into \(z\)-scores.rnorm() generates random draws from a normal distribution with a given mean and standard deviation. The default mean is 0 and the default standard deviation is 1.What about arbitrary distributions that we observe in data? We know how to calculate the mean and variance of a random variable, but it’s helpful to visualize the distribution.
If your data is sufficiently discrete, I prefer to use table(). Of course, “sufficiently discrete” is difficult to define. If your data has many unique values, you may need to transform the data via rounding before using table(). Otherwise, you can use hist() to generate a histogram (which is effectively what table() is doing) or density() to create a kernel density plot. You can think of kernel density plots as smoothed out histograms.
Let’s take a look at some real estate transaction data as an example.
# https://vincentarelbundock.github.io/Rdatasets/doc/AER/HousePrices.html
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/HousePrices.csv")
# Keep only columns 2 through 6.
# These rows are the ones that contain numeric data.
# What this step below is doing is sub-setting the columns,
# then saving over the original.
df <- df[,2:6]
head(df) price lotsize bedrooms bathrooms stories
1 42000 5850 3 1 2
2 38500 4000 2 1 1
3 49500 3060 3 1 1
4 60500 6650 3 1 2
5 61000 6360 2 1 1
6 66000 4160 3 1 1
First, let’s examine the distribution of bedrooms.
Next, let’s try to plot the distribution of sale price.
This is not a very informative plot as there’s no discernible pattern. Rather, let’s divide by 10,000, round the data, and re-multiply by 10,000.
This is a much better plot. We can see that were are some outliers that are over $150,000, and the modal sale is about $60,000.
Finally, we want a way to tabulate multiple summary statistics for multiple variables. To do this, we can use R’s summary() function.
price lotsize bedrooms bathrooms
Min. : 25000 Min. : 1650 Min. :1.000 Min. :1.000
1st Qu.: 49125 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000
Median : 62000 Median : 4600 Median :3.000 Median :1.000
Mean : 68122 Mean : 5150 Mean :2.965 Mean :1.286
3rd Qu.: 82000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000
Max. :190000 Max. :16200 Max. :6.000 Max. :4.000
stories
Min. :1.000
1st Qu.:1.000
Median :2.000
Mean :1.808
3rd Qu.:2.000
Max. :4.000
However, summary() isn’t our best option for creating a table. We are going to use our first R package for this task. Using an R package takes two steps. First, you will need to install the package (download the package from the internet onto your computer), and then call the package (loading the package into the current R session). You only need to install the package once, but you need to call the package every time you use it.
Below, I am installing the modelsummary package and then loading it into our session. You can find the documentation for modelsummary here. We also need to install kableExtra, but we won’t use it until later.
This package contains a few different functions, but we are going to focus on two: datasummary() and datasummary_skim(). You can find the documentation for datasummary here. To start, let’s see the output of datasummary_skim().
While this provides us with a lot of useful information, it lacks customization. On the other hand, datasummary() provides us with much more versatility. To start, take a look at the below example. We are summarizing price, lotsize, and bedrooms from the dataset df. For each variable, we are going to calculate length (sample size), Mean, SD, Min, and Max.
To further complicate this, we can include many other options. However, for simplicity, we are going to focus on the most important ones. First, we are going give the table a title. Second, we are going to rename the variables. Take a look:
| N | Mean | St. Dev. | Min | Max | |
|---|---|---|---|---|---|
| Price | 546.00 | 68121.60 | 26702.67 | 25000.00 | 190000.00 |
| Lot Size | 546.00 | 5150.27 | 2168.16 | 1650.00 | 16200.00 |
| Bedrooms | 546.00 | 2.97 | 0.74 | 1.00 | 6.00 |
ECON 311: Economics, Causality, and Analytics